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Abstract 


Localized  operators,  like  Gabor  wavelets  and  difference-of-Gaussian  fdters,  are  considered  to  be  useful 
tools  for  image  representation.  This  is  due  to  their  ability  to  form  a  ‘sparse  code’  that  can  serve  as  a 
basis  set  for  high-fidelity  reconstruction  of  natural  images.  However,  for  many  visual  tasks,  the  more 
appropriate  criterion  of  representational  efficacy  is  ‘recognition’,  rather  than  ‘reconstruction’.  It  is 
unclear  whether  simple  local  features  provide  the  stability  necessary  to  subserve  robust  recognition  of 
complex  objects.  In  this  paper,  we  search  the  space  of  two-lobed  differential  operators  for  those  that 
constitute  a  good  representational  code  under  recognition/discrimination  criteria.  We  find  that  a  novel 
operator,  which  we  call  the  ‘dissociated  dipole’  displays  useful  properties  in  this  regard.  We  describe 
simple  computational  experiments  to  assess  the  merits  of  such  dipoles  relative  to  the  more  traditional 
local  operators.  The  results  suggest  that  non-local  operators  constitute  a  vocabulary  that  is  stable  across 
a  range  of  image  transformations. 
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Foundation  Fellowship. 
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Introduction 

Information  theory  has  become  a  valuable  tool  for  understanding  the  functional  significance  of  neural 
response  properties.  In  particular,  the  idea  that  a  goal  of  early  sensory  processing  may  be  to  efficiently 
encode  natural  stimuli  has  generated  a  large  body  of  work  describing  the  function  of  the  human  visual 
system  in  terms  of  redundancy  reduction  and  ‘maximum-entropy’  responses  (Attneave  1954;  Barlow 
1961;  Atick  1992;  Field  1994). 

In  the  compound  eye  of  the  fly,  for  example,  the  contrast  response  function  of  a  particular 
class  of  interneuron  approximates  the  distribution  of  contrast  levels  found  in  natural  scenes  (Laughlin 
1981).  This  is  the  most  efficient  encoding  of  contrast  fluctuations,  meaning  that,  from  the  point  of  view 
of  information  theory,  these  cells  are  optimally  tuned  to  the  statistics  of  their  environment.  In  the 
context  of  the  primate  visual  system,  it  has  been  proposed  that  the  receptive  fields  of  various  cells  may 
have  the  form  they  do  for  similar  reasons.  Olshausen  and  Field  (Olshausen  and  Field  1996;  Olshausen 
and  Field  1997)  and  Bell  and  Sejnowski  (Bell  and  Sejnowski  1997)  have  demonstrated  that  the 
oriented  “edge-finding”  receptive  fields  that  are  found  in  early  visual  cortex  (Hubei  and  Wiesel  1959) 
may  exist  because  they  provide  an  encoding  of  natural  scenes  that  maximizes  information.  Olshausen 
and  Field  were  able  to  produce  such  filters  through  enforcing  “sparseness”  constraints  on  their 
encoding  while  ensuring  that  the  representation  allowed  for  high-fidelity  reconstruction  of  the  original 
scene.  Bell  and  Sejnowski  enforced  the  statistical  independence  of  the  filters  rather  than  working  with 
an  explicit  sparseness  criterion.  These  two  approaches  are  actually  equivalent,  as  demonstrated  by 
Olshausen  and  Field.  An  aspect  of  Bell  &  Sejnowski’s  work  that  sets  it  apart,  however,  is  their 
progression  through  constraints  of  different  strength,  such  as  PCA  (orthogonal  basis),  ZCA  (zero-phase 
whitening  filters)  and  finally  ICA  (statistical  independence).  These  different  constraints  lead  to 
qualitatively  different  filters,  such  as  checkerboard-like  structures  and  center-surround  functions, 
resembling  the  preferred  stimuli  of  cells  found  in  some  parts  of  the  visual  pathway  (V4  and  the  LGN, 
respectively). 

The  search  for  efficient  codes  has  helped  direct  the  efforts  of  researchers  interested  in 
explaining  neural  response  properties  in  the  visual  system,  and  fostered  the  study  of  ecological 
constraints  in  natural  scenes  (Simoncelli  and  Olshausen  2001).  However,  there  are  many  other  tasks 
that  the  visual  system  must  accomplish,  for  which  the  goal  may  be  quite  different  from  high-fidelity 
input  reconstruction.  The  task  of  recognizing  of  complex  objects  is  an  important  case  in  point.  A  priori, 
we  cannot  assume  that  the  same  computations  which  result  in  sparse  coding  would  also  support  robust 
recognition.  Indeed,  the  resilience  of  human  recognition  performance  to  image  degradations  suggests 
that  image  measurements  underlying  recognition  can  survive  significant  reductions  in  reconstruction 
quality.  Extracting  measurements  that  are  stable  against  ecologically  relevant  transformations  of  an 
object  (lighting,  pose,  etc.)  is  a  constraint  that  might  result  in  qualitatively  different  receptive  field 
structures  from  the  ones  that  support  high-fidelity  reconstruction. 

In  this  paper,  we  examine  the  nature  of  receptive  fields  that  emerge  under  a  recognition,  rather 
than  reconstruction,  based  criterion.  We  develop  and  illustrate  our  ideas  primarily  in  the  context  of 
human  faces,  although  we  expect  that  similar  analyses  can  be  conducted  with  other  object  classes  as 
well.  In  this  analysis,  we  note  the  emergence  of  a  novel  receptive  field  structure  that  we  call  the 
‘dissociated  dipole.’  These  dipoles  (or  ‘sticks’)  perform  simple  non-local  luminance  comparisons, 
allowing  for  a  region-based  representation  of  image  structure. 

We  also  compare  the  stability  characteristics  of  various  kinds  of  filters.  These  include  model 
neurons  with  receptive  field  structures  like  those  found  by  ‘sparse  coding’  constraints  and  ‘sticks’ 
operators.  Our  goal  is  to  eventually  gain  an  understanding  of  how  object  representations  that  are  useful 
for  recognition  might  be  constructed  from  simple  image  measurements. 

Experiment  1  -  Searching  for  Simple  Features  in  the  domain  of  Faces 

We  begin  by  investigating  what  kinds  of  simple  features  can  be  used  to  discriminate  between  frontally 
viewed  faces.  The  choice  of  a  specific  example  class  is  primarily  for  ease  of  exposition.  The  ideas  we 
develop  are  intended  to  be  more  generally  applicable.  (We  substantiate  this  claim  in  Experiment  2 
when  we  describe  computational  experiments  with  arbitrary  object  classes.) 

Computationally,  there  are  many  methods  for  performing  the  face  discrimination  task  with 
relatively  high  accuracy,  especially  if  the  faces  are  already  well-normalized  for  position,  pose,  and 
scale.  Using  nothing  more  than  the  Euclidean  distance  between  faces  to  do  nearest-neighbor 
classification  in  pixel  space,  one  can  obtain  reasonably  good  results  (~65%  with  a  40-person 
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classification  task  using  the  ORL  database,  compiled  by  AT&T  Laboratories,  Cambridge,  UK).  Using 
‘eigenfaces,’  one  can  improve  this  score  somewhat  by  removing  the  contribution  of  higher-order 
eigenvectors,  effectively  ‘de-noising’  the  face  space.  Further  adjustments  can  be  made  as  well, 
including  the  explicit  modeling  of  intra-  and  inter-personal  differences  (Moghaddam,  Jebara  et  al. 
2000)  and  the  use  of  more  complex  classifiers.  On  the  other  side  of  the  spectrum  from  these  global 
techniques,  there  are  also  methods  for  rating  facial  similarity  that  rely  on  Gabor  jets  placed  at  fiducial 
points  on  a  face  (Wiskott,  Fellous  et  al.  1997).  These  techniques  use  information  at  multiple  spatial 
scales  to  produce  a  representation  built  up  from  local  analyses,  and  are  also  quite  successful. 

The  overall  performance  of  these  systems  depends  both  on  the  choice  of  representation  and 
the  back-end  classification  strategy.  Since  we  focus  exclusively  on  the  former,  our  goal  is  not  to 
produce  a  system  for  recognition  that  is  superior  to  these  approaches,  but  rather  to  explore  the  space  of 
front  end  feature  choices.  In  other  words,  we  look  within  a  specific  set  of  image  measurements  -  bi- 
lobed  differential  operators,  to  see  what  spatial  analyses  lead  to  the  best  invariance  across  images  of 
the  same  person.  For  our  purposes,  a  “bi-lobed  differential  operator”  is  a  feature  type  in  which 
weighted  luminance  is  first  calculated  over  two  image  regions  and  the  final  output  of  the  operator  is 
the  signed  difference  between  those  two  average  values.  In  general,  these  two  image  regions  need  not 
be  connected.  Some  examples  of  these  filters  are  shown  in  Figure  1. 


Figure  1  -  Some  examples  of  bi-lobed  differential  operators  of  the  sort  we  employ  in  Experiment  1. 

Conceptually,  the  design  of  our  experiment  is  as  follows:  We  exhaustively  consider  all 
possible  bi-lobed  differential  operators  (with  the  individual  lobes  modeled  as  rectangles  for  simplicity). 
We  evaluate  the  discrimination  performance  of  the  corresponding  measurements  over  a  face  database 
(discriminability  refers  to  maximizing  separation  between  individuals  and  minimizing  distances  within 
instances  of  the  same  person).  By  sorting  the  large  space  of  all  operators  using  the  criterion  of 
discriminability,  we  can  determine  which  are  likely  to  constitute  a  good  vocabulary  for  recognition. 

We  note  that  this  approach  differs  substantially  from  efforts  to  find  reliable  features  for  face 
and  object  detection  in  cluttered  backgrounds.  For  example,  Ullman’s  work  on  features  of 
“intermediate  complexity”  (Ullman,  Vidal-Naquet  et  al.  2002)  demonstrates  a  method  for  learning 
class-diagnostic  image  fragments  using  mutual  information.  These  IC  features  are  both  very  likely  to 
be  present  in  an  image  when  the  object  is  present  and  unlikely  to  appear  in  the  image  background  by 
chance.  Other  feature  learning  studies  have  concentrated  on  developing  generative  models  for  object 
recognition  (Fei-Fei,  Fergus  et  al.  2003;  Fergus,  Perona  et  al.  2003;  Fei-Fei,  Fergus  et  al.  2004)  in 
which  various  appearance  densities  are  estimated  for  diagnostic  image  fragments.  This  allows  for 
recognition  of  an  object  in  a  cluttered  scene  to  proceed  in  a  Bayesian  manner. 

These  studies  are  unquestionably  valuable  to  our  understanding  of  object  recognition.  Our 
goals  in  the  current  study  are  slightly  different,  however.  First  of  all,  we  are  interested  in  discovering 
what  features  support  invariance  to  a  particular  object  rather  than  a  particular  object  class.  It  is  for  this 
reason  that  we  do  not  attempt  to  segment  the  objects  under  consideration  from  a  cluttered  background. 
We  envision  segmentation  proceeding  via  parts-based  representations  such  as  those  described  above. 
While  it  may  be  possible  to  learn  diagnostic  features  of  an  individual  that  could  be  used  for 
segmentation  purposes,  we  believe  it  may  be  more  useful  to  consider  segmentation  as  a  process  that 
proceeds  prior  to  individuation.  Second,  rather  than  looking  for  complex  object  parts  that  support 
invariance  we  commence  by  considering  very  simple  features.  This  means  that  we  are  not  likely  to  find 
optimal  features  for  individuation.  Instead,  we  aim  to  determine  what  structural  properties  of 
potentially  low-level  RFs  contribute  to  recognition.  In  a  sense,  we  are  trying  to  understand  what 
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computations  between  the  lowest  and  highest  levels  of  visual  processing  lead  to  the  impressive 
invariances  for  object  transformations. 

Given  that  we  are  attempting  to  understand  how  recognition  abilities  are  built  up  from  low- 
level  features,  one  might  ask  why  we  do  not  explicitly  assume  pre-processing  by  center-surround  or 
wavelet  filters.  Such  an  analysis  could  help  us  understand  how  the  outputs  of  early  visual  areas  (such 
as  the  LGN  and  VI)  serve  as  the  basis  for  further  computations  that  might  support  recognition.  That 
said,  we  have  chosen  not  to  adopt  this  strategy,  so  that  we  can  remain  completely  agnostic  as  to  what 
basic  computations  are  necessary  first  steps  towards  solving  high-level  problems. 

Stimuli 

We  use  faces  drawn  from  the  ORL  database  (Samaria  and  Harter  1994)  for  this  initial  experiment.  The 
images  are  all  1 12x92  pixels  in  size,  and  there  are  ten  unique  images  of  each  of  the  40  individuals 
included  in  the  database.  We  chose  to  work  with  21  randomly  chosen  individuals  in  the  database,  using 
the  first  5  images  of  each  person.  The  faces  are  imaged  against  uniform  backdrops.  Therefore,  the  task 
in  our  experiment  is  not  to  segregate  faces  from  a  cluttered  background,  but  rather  to  individuate  them. 

Preprocessing 

Block-averaging-  Relaxing  locality  constraints  results  in  a  very  large  number  of  allowable  square 
differential  operators  in  a  particular  image.  To  reduce  the  size  of  our  search  space,  we  first  down- 
sample  all  of  the  images  in  our  database  to  a  much  smaller  size  of  1 1x9  pixels.  Much  of  the 
information  necessary  for  successful  classification  is  present  at  this  small  size,  as  evidenced  by  the  fact 
that  the  recognition  performance  of  a  simple  nearest-neighbor  classifier  actually  increases  slightly 
(from  65%  correct  at  full-resolution  to  70%  using  8x8  pixel  ‘blocks’)  if  we  use  these  smaller  images  as 
input. 

Constructing  Difference  Vectors-  Our  next  step  involves  changing  our  recognition  problem  from  a 
21 -class  categorization  task  into  a  binary  one.  We  do  this  by  constructing  difference  vectors,  which 
will  comprise  two  classes  of  intra-  and  inter-personal  variation  (Moghaddam,  Jebara  et  al.  2000). 
Briefly,  we  subtract  one  image  from  another,  and  if  the  two  images  used  depicted  the  same  individual, 
then  that  difference  vector  captures  intra-personal  variation.  If,  on  the  other  hand,  the  two  images  were 
of  different  individuals,  then  that  difference  vector  would  be  one  that  captured  inter-personal  variation. 
Given  these  two  sets,  we  can  now  look  for  spatial  features  that  can  distinguish  between  these  two  types 
of  variation  in  facial  appearance,  rather  than  attempting  to  find  features  that  are  always  stable  within 
each  of  21  categories.  To  assemble  the  difference  vectors  used  in  this  experiment,  we  took  all  unique 
pair-wise  differences  between  images  that  depicted  the  same  person  (Intra-personal  set)  and  used  the 
first  image  of  each  individual  to  construct  a  set  of  pair-wise  differences  that  matched  our  first  set  in 
size  (Inter-personal  set). 

Constructing  ‘Integral  Images’  -  Finally,  now  that  we  have  two  sets  of  low-resolution  difference 
vectors,  we  introduce  one  last  pre-processing  step  designed  to  speed  up  the  execution  of  our  search. 
Since  the  differential  operators  we  are  analyzing  have  rectangular  lobes,  we  construct  ‘integral  images’ 
(Viola  and  Jones  2001)  from  each  of  our  difference  vectors.  Integral  images  allow  for  the  fast 
computation  of  rectangular  image  features,  reducing  the  process  to  a  series  of  look-ups.  The  value  of 
each  pixel  in  the  integral  image  created  from  a  given  stimulus  represents  the  sum  of  all  pixels  above 
and  to  the  left  of  that  pixel  in  the  original  picture. 

Feature  Ranking 

In  our  11x9  images,  there  are  a  total  (n)  of  2970  unique  box  features.  Given  that  we  are  interested  in  all 
possible  differential  operators,  there  are  approximately  4.5  million  spatial  features  (n2/2)  for  us  to 
consider.  To  decide  which  of  these  features  were  ‘best’  for  recognition,  we  used  A’  as  our  measure  of 
discriminability  (Green  and  Swets  1966).  A’  is  a  non-parametric  measure  of  discriminability  calculated 
by  finding  the  area  underneath  an  observer’s  ROC  (receiver-operating-characteristic)  curve.  This  curve 
is  determined  by  plotting  the  number  of  “hits”  and  “false  alarms”  a  given  observer  obtains  when  using 
a  particular  numerical  threshold  to  judge  the  presence  or  absence  of  a  signal. 

In  this  experiment,  we  treat  each  differential  operator  as  one  “observer.”  The  “signals”  we 
wish  to  detect  are  the  intra-personal  difference  vectors.  The  response  of  each  operator  (mean  value  of 
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pixels  under  the  white  rectangle  minus  mean  value  of  pixels  under  the  black  rectangle)  was  calculated 
on  each  difference  vector,  and  then  the  labels  associated  with  those  vectors  (intra-  v.  inter-personal 
variation)  were  sorted  according  to  that  numerical  output.  With  the  distribution  of  labeled  difference 
vectors  in  hand  for  a  particular  feature,  we  could  proceed  to  calculate  the  value  of  A’.  We  determined 
how  many  hits  and  false-alarms  there  would  be  for  a  threshold  placed  at  each  possible  location  along 
the  continuum  of  observed  feature  values.  This  allowed  us  to  plot  a  discretized  ROC  curve  for  each 
feature.  Calculating  the  area  underneath  this  curve  is  straightforward,  yielding  the  discriminability  for 
that  operator.  A’  scores  range  from  0.5  to  1.  A  perfect  separation  of  intra-  and  inter-personal 
difference  vectors  would  lead  to  an  A’  score  of  1,  while  a  complete  enmeshing  of  the  two  classes 
would  lead  to  a  score  of  0.5. 

In  one  simulation,  the  absolute  value  of  each  feature  was  taken  (rectified  results),  and  in 
another  the  original  responses  were  unaltered  (unrectified  results).  In  this  way,  we  could  establish  how 
instances  of  each  class  were  distributed  with  respect  to  each  spatial  feature,  both  with  and  without 
information  concerning  the  direction  of  brightness  differences. 

It  is  important  to  note  at  this  stage  that  there  is  no  reason  to  expect  that  any  of  the  values  we 
recover  from  our  analysis  of  these  spatial  features  will  be  particularly  high.  In  “boosting”  procedures, 
it  is  customary  to  use  a  cascade  of  relatively  poor  filters  to  construct  a  classifier  capable  of  robust 
performance,  meaning  that  even  with  a  collection  of  ‘bad’  features,  one  can  obtain  worthwhile  results. 
In  this  experiment,  we  are  only  interested  in  the  relative  ranking  of  features,  though  it  is  possible  that 
the  set  of  features  we  obtain  could  be  useful  for  recognition  despite  their  poor  abilities  in  isolation.  We 
shall  explicitly  consider  the  utility  of  the  features  discovered  here  in  a  recognition  paradigm  presented 
in  Experiment  2. 


Results 

Differential  Operators  -The  top-ranked  differential  operators  recovered  from  our  analysis  of  the 
space  of  possible  two-lobedbox  filters  are  displayed  in  Figure  2.  Both  rectified  and  unrectified  results 
are  displayed.  As  we  expected,  the  A’  measured  for  each  individual  feature  is  not  particularly  high, 
with  the  best  operator  in  these  two  sets  scoring  approximately  0.71. 


Umectified  Featuies 


n  a  d  i  ■ 
annas 
nnui 
nnyi 
my  ii 
■  a  t  a  ■ 
a  a  a  a  n 
mm 
mam 
naaan 


nanaa 
n  a  n  a  a 
i.any 
a  *  a  a  n 

n  _  Ji¬ 
ns  s  i  u 
□  a  a  a  a 
a  a  n  n  n 

a  a  l  a  a 
^  a  a  □  a 


a  n 


Rectified  Fe.tmes 

3 

= 

3 

a 

= 

D 

H 

E 

n 

B 

n 

n 

a 

a 

u 

□ 

a 

U 

H 

□ 

n 

n 

n 

■ 

ii 

a 

□ 

II 

II 

1 

n 

a 

a 

i 

a 

B 

u 

n 

n 

1 

n 

n 

i 

n 

n 

II 

a 

c 

H 

□ 

n 

ii 

n 

a 

n 

1 

■ 

H 

B 

a 

B 

n 

a 

n 

n 

C 

H 

n 

H 

n 

n 

□ 

□ 

n 

n 

D 

C 

B 

fl 

n 

a 

B 

n 

a 

a 

n 

c 

ii 

B 

B 

m 

■ 

H 

m 

B 

H 

fl 

n 

n 

n 

=  iib  a 


Figure  2  -  The  top  100  ranked  features  for  discriminating  between  intra-  and  inter-personal  difference 
vectors.  Beneath  each  10x10  array  are  representatives  of  the  most  common  features  found  in  the  two 
arrays. 


There  are  four  main  classes  of  features  that  dominate  the  top  100  differential  operators.  First 
of  all,  features  resembling  center-surround  structures  appear  in  several  top  slots,  both  in  the  rectified 
and  unrectified  data.  This  is  somewhat  surprising,  given  that  cells  with  this  structure  are  most 
commonly  associated  with  very  early  visual  processing  implicated  in  low-level  tasks  such  as  contrast 
enhancement,  rather  than  higher-level  tasks  like  recognition.  Of  course,  the  features  we  have  recovered 
here  are  far  larger  in  terms  of  their  “receptive  field”  than  typical  center-surround  filters  used  for  early 
image  processing,  so  perhaps  these  structures  are  useful  for  recognition  if  scaled  up  to  larger  sizes. 
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The  second  type  of  feature  that  is  very  prevalent  in  the  results  is  what  we  will  call  a 
“dissociated  dipole”  or  “stick”  operator,  and  appears  primarily  in  the  unrectified  results.  These  features 
have  a  spatially  disjoint  structure,  meaning  that  they  execute  brightness  comparisons  across  widely 
separate  parts  of  an  image.  Admittedly,  the  connection  between  these  operators  and  the  known 
physiology  of  the  primate  visual  system  is  weak.  To  date,  there  have  been  no  cells  with  this  sort  of 
dissociated  receptive-field  structure  found  in  the  human  visual  pathway,  although  they  may  exist  in  the 
auditory  and  somatosensory  processing  streams  (Young  1984;  Chapin  1986). 

The  final  two  features  are  elongated  edge  and  line  detectors,  which  dominate  the  results  of  the 
rectified  operators.  An  elongated  edge  detector  appears  in  the  unrectified  rankings  as  well,  but  other 
structurally  similar  features  are  found  only  in  the  next  100  ranked  features.  These  structures  resemble 
some  of  the  receptive  fields  known  to  exist  in  striate  cortex,  as  well  as  the  wavelet-like  operators  that 
support  sparse  coding  of  natural  scenes. 

We  point  out  that  multiple  ‘copies’  of  these  features  appear  throughout  our  rankings,  which  is 
to  be  expected.  Small  structural  changes  to  these  filters  only  slightly  alter  their  A’  score,  meaning  that 
many  of  the  top  features  have  very  similar  forms.  We  do  not  attribute  any  particular  importance  to  the 
fact  that  the  non-local  operators  that  perform  best  appear  to  be  comparing  values  on  the  right  edge  of 
the  image  to  values  in  the  center,  nor  to  the  tendency  for  elongated  edge  detectors  to  appear  in  the 
center  of  the  image.  It  is  only  the  generic  structure  of  each  operator  that  is  important  to  us  here. 

Single  Rectangle  Features  -  We  chose  to  examine  differential  operators  in  our  initial  analysis  for 
several  reasons.  First  of  all,  cells  with  both  excitatory  and  inhibitory  regions  are  found  throughout  the 
visual  system.  Second,  by  taking  the  difference  in  luminance  between  one  region  or  another,  one  is  far 
less  sensitive  to  uniform  changes  in  illumination  brought  on  by  haze,  bright  lighting,  etc.  However, 
given  that  we  are  using  a  database  of  faces  that  is  already  relatively  well  controlled  in  terms  of  lighting 
and  pose,  it  may  be  the  case  that  even  simpler  features  can  support  recognition.  To  examine  this 
possibility,  we  conduct  the  same  analysis  described  above  for  differential  operators  on  the  set  of  all 
single-rectangle  box  features  in  our  images. 

We  find  that  single-rectangle  features  are  not  as  useful  for  discriminating  between  our  two 
classes  as  are  differential  operators.  The  range  of  A’  values  for  the  top  100  features  from  each  category 
are  plotted  in  Figure  3,  where  it  is  clear  that  both  sets  of  differential  operators  provide  better 
recognition  performance  than  single  box-filters.  Even  in  circumstances  where  many  of  the  reasons  to 
employ  differential  operators  have  been  removed  through  clever  database  construction  (say,  by 
disallowing  fluctuations  in  ambient  illumination),  we  find  that  they  still  out-perform  simpler 
measurements. 


Figure  3  -  Plots  of  Aprime  scores  across  the  best  features  from  each  family  of  operators  (single  v. 
double  rectangle  features,  as  well  as  rectified  v.  unrectified  operator  values). 
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Discussion 

In  our  analysis  of  the  best  differential  operators  for  face  recognition,  we  have  observed  a  new  type  of 
operator  (the  dissociated  dipole)  that  offers  an  alternative  form  of  processing  by  which  within-class 
stability  might  be  achieved  for  images  of  faces.  An  important  question  to  consider  is  how  this  operator 
fits  within  the  framework  of  previous  computational  models  of  recognition,  as  well  as  whether  or  not  it 
has  any  relevance  to  human  vision. 

The  dissociated  dipole  is  an  instance  of  a  higher-order  image  statistic,  namely  a  binary 
measurement.  The  notion  that  such  statistics  might  be  useful  for  pattern  recognition  is  not  new,  indeed 
Julesz  (Julesz  1975)  suggested  that  ‘needle  statistics’  could  be  useful  for  characterizing  random-dot 
textures.  In  the  computer  vision  community,  non-local  comparisons  are  employed  in  integral  geometry 
to  characterize  shapes  (Novikoff  1962).  The  possibility  that  non-local  luminance  comparisons  may  be 
useful  for  object  and  face  recognition  has  not  been  thoroughly  explored,  however.  Such  an  approach 
differs  from  traditional  shape-based  approaches  to  object  recognition,  in  that  it  implicitly  considers 
relationships  between  regions  to  be  of  paramount  importance.  Our  recent  results  (Balas  and  Sinha 
2003)  have  demonstrated  that  such  a  non-local  representation  of  faces  provides  for  better  recognition 
performance  than  a  strictly  local  one.  Furthermore,  Kouh  &  Riesenhuber  (Kouh  and  Riesenhuber 
2003)  have  found  that  to  model  the  responses  of  V4  neurons  to  various  gratings  using  the  HMAX 
model  of  recognition  (Riesenhuber  and  Poggio  1999)  it  is  necessary  to  pool  responses  from  spatially 
disjoint  low-level  neurons. 

Before  proceeding,  we  wish  to  specify  more  precisely  the  relationship  between  local,  non¬ 
local,  and  global  image  analysis.  We  consider  local  analyses  those  in  which  a  contiguous  set  of  pixels 
(either  4  or  8-connected)  are  represented  in  terms  of  a  single  output  value.  A  global  analysis  is  similar 
to  this,  save  for  the  amount  of  the  image  under  consideration.  In  the  limit,  a  global  image  analysis  uses 
all  pixels  in  the  image  to  construct  the  output  value.  A  local  analysis  might  only  use  some  small 
percentage  of  image  area.  This  distinction  is  not  truly  categorical.  Rather,  there  is  a  spectrum  between 
local  and  global  image  analysis. 

Likewise,  a  similar  spectrum  exists  between  local  and  non-local  analysis.  While  a  local 
analysis  only  considers  a  set  of  contiguous  pixels,  a  non-local  analysis  breaks  this  condition  of 
contiguity.  In  the  extreme,  one  can  imagine  a  highly  non-local  feature  composed  of  two  pixels  located 
at  opposite  corners  of  an  image.  At  the  other  extreme  would  be  a  highly  local  feature  consisting  of  two 
neighboring  pixels.  Of  course,  there  are  many  operators  spanning  these  two  possibilities  that  are 
neither  purely  local  or  non-local.  Moreover,  if  one  measures  local  features  (like  Gabor  filter  outputs)  at 
several  non-overlapping  positions,  is  this  a  local  or  a  non-local  analysis?  If  one  is  merely  concatenating 
the  values  of  each  local  analysis  into  one  feature  vector,  then  this  is  not  a  truly  non-local  computation 
by  our  definition.  If  however  the  values  of  those  local  features  are  explicitly  combined  to  produce  one 
output  value,  then  we  would  have  arrived  at  a  non-local  analysis  of  the  image.  Non-local  analysis  of 
this  type  has  traditionally  received  less  attention  than  local  or  global  strategies  of  image  processing. 

The  reason  non-local  representations  of  brightness  have  not  been  studied  in  great  detail  may 
be  due  to  the  sheer  number  of  generic  binary  statistics.  In  general,  the  trouble  with  appeals  to  higher- 
order  statistics  for  recognition  is  that  there  is  a  vast  space  of  possible  measurements  that  are  allowable 
with  the  introduction  of  new  parameters  (in  our  case,  the  distance  between  operator  lobes).  This 
combinatorial  explosion  makes  it  hard  to  determine  which  particular  measurements  are  actually  useful 
within  the  large  range  of  possibilities.  This  is,  of  course,  a  serious  problem  in  that  the  utility  of  any  set 
of  proposed  measurements  is  dependent  on  the  ability  to  separate  helpful  features  from  useless  ones. 

We  also  note  that  there  are  several  computational  oddities  associated  with  non-local  operators. 
Suppose  that  we  formulate  a  “dissociated  dipole”  as  a  difference-of-offset-Gaussians  operator  (a  model 
we  present  in  full  in  the  next  experiment),  allowing  the  distance  between  the  two  Gaussians  to  be 
manipulated  independently  of  either  one’s  spatial  constant  (Figure  4)  In  so  doing,  we  lose  the  ability  to 
create  ‘steerable’  filters  (Freeman  and  Adelson  1991),  meaning  that  to  obtain  dipoles  at  a  range  of 
orientations  we  have  no  other  option  than  to  use  a  large  number  of  operators.  This  is  not  impossible, 
but  lacks  the  elegance  and  efficiency  of  more  traditional  approaches  by  which  multi-scale 
representations  can  be  created  at  any  orientation  through  the  use  of  a  small  number  of  basis  functions. 
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Figure  4  -  A  dipole  measurement  is  parameterized  in  terms  of  the  space  constant  a  of  each  lobe,  the 
distance  5  between  the  centers  of  each  lobe,  and  the  angle  of  orientation, 9. 

Another  important  difference  between  local  and  non-local  computations  is  the  distribution  of 
operator  outputs.  Natural  images  are  spatially  redundant,  meaning  that  the  output  of  most  local 
operators  is  near  zero  (Kersten  1987).  The  result  is  a  highly  kurtotic  distribution  of  filter  outputs, 
indicating  that  a  sparse  representation  of  the  image  using  those  filters  is  expected.  In  many  cases,  this 
is  highly  desirable,  both  from  metabolic  and  computational  viewpoints.  As  we  increase  the  distance 
between  the  offset  Gaussians  we  use  to  model  dissociated  dipoles,  the  kurtosis  of  the  distribution 
decreases  significantly.  This  means  that  using  these  operators  yields  a  coarse  (or  “distributed”) 
encoding  of  the  image  under  consideration.  This  may  not  be  unreasonable,  especially  given  that 
distributed  representations  of  complex  objects  may  help  increase  robustness  to  image  degradation. 
However,  it  is  important  to  note  that  non-local  computations  depart  from  some  conventional  ideas 
about  image  representation  in  significant  ways. 

In  our  next  experiment,  we  shall  directly  address  the  question  of  whether  or  not  the  structures 
we  have  discovered  in  this  analysis  are  useful  for  face  and  object  classification.  In  this  next  analysis, 
we  remove  many  of  the  simplifications  necessary  for  an  exhaustive  search  to  be  tractable  in 
Experiment  1.  We  also  move  beyond  the  domain  of  face  recognition  to  include  multiple  object  classes 
in  our  recognition  task. 

Experiment  2  -  Face  and  object  recognition  using  local  and  non-loeal  features 

In  our  first  experiment,  we  noted  the  emergence  of  center-surround  operators  and  non-local  operators 
under  a  recognition  criterion  for  frontally-viewed  faces.  However,  in  our  first  experiment  many 
compromises  were  made  in  order  to  conduct  an  exhaustive  search  through  the  space  of  possible 
operators.  First,  our  images  were  reduced  to  an  extremely  small  size  in  order  to  limit  the  number  of 
features  we  needed  to  consider.  Though  faces  can  be  recognized  at  very  low  resolutions,  it  is  also  clear 
that  there  is  interesting  and  useful  structure  at  small  spatial  scales.  Second,  we  chose  to  work  with 
difference  images  rather  than  the  original  faces.  This  allowed  us  to  transform  a  multi-category 
classification  task  into  a  binary  task,  yet  makes  the  implicit  assumption  that  a  differencing  operation 
occurs  as  part  of  the  recognition  process.  Third,  we  point  out  that  in  any  consideration  of  all  possible 
bi-lobed  features  in  an  image  the  number  of  non-local  features  will  far  exceed  the  number  of  local 
features.  Greater  numbers  need  not  imply  better  performance,  yet  it  is  still  troubling  to  consider  that 
the  abundance  of  useful  non-local  operators  may  be  a  function  of  set  size.  Finally,  we  note  that  in  only 
considering  face  images,  it  is  unclear  whether  the  features  we  discovered  are  useful  for  general 
recognition  purposes  or  specific  to  face  matching. 

In  this  second  experiment,  we  attempt  to  address  these  concerns  through  a  recognition  task 
that  eliminates  many  of  these  difficulties.  We  employ  full-resolution  images  of  both  faces  and  various 
complex  objects  in  a  classification  task  designed  to  test  the  efficacy  of  center-surround,  local-oriented, 
and  non-local  features  in  an  unbiased  fashion. 
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Stimuli 

For  our  face  recognition  experiment,  we  once  again  make  use  of  the  ORL  database.  In  this  case,  all  40 
individuals  were  used,  with  one  image  of  each  person  serving  as  a  training  image.  The  images  were  not 
pre-processed  in  any  way,  and  remained  at  full  resolution  (112x92  pixels). 

To  help  determine  if  our  findings  hold  up  across  a  range  of  object  categories,  we  also  conduct 
this  recognition  experiment  with  images  taken  from  the  COIL  database  (Nayar  et  al  1996;  Nene  et  al 
1996).  These  images  are  128x128  pixel  images  of  100  different  objects,  including  toy  cars,  foods, 
pharmaceutical  products,  and  many  other  diverse  items.  We  selected  these  images  for  the  wide  range 
of  surface  and  structural  properties  represented  by  the  objects.  Also,  repeated  exemplars  of  a  few 
object  categories  (such  as  cars)  make  both  across-class  and  within-class  recognition  necessary.  Each 
object  is  depicted  rotated  in  depth  from  its  original  position  in  increments  of  5  degrees.  We  chose  the 
0-degree  images  of  each  object  as  training  images,  and  used  the  following  9  images  as  test  images.  The 
only  pre-processing  performed  on  these  images  was  reducing  them  from  full-color  to  grayscale. 

Face  recognition  stimuli 


Object  recognition  stimuli 


Figure  5  -Examples  of  stimuli  used  in  Experiment  2.  The  top  row  contains  training  images  of  several 
individuals  depicted  in  the  ORL  database.  The  bottom  row  contains  training  images  of  objects  depicted 
in  the  COIL  database.  Note  that  the  COIL  database  contains  multiple  exemplars  of  some  object  classes 
(such  as  the  cars  in  this  figure)  making  within-class  discrimination  a  necessary  part  of  performing 
recognition  well  using  this  database. 

Procedure 

To  determine  the  relative  performance  of  center-surround,  local-oriented,  and  non-local  features  in  an 
unbiased  way,  we  model  all  of  our  features  as  generalized  difference-of-gaussian  operators.  A  generic 
bi-lobed  operator  in  two-dimensional  space  can  be  modeled  as  follows: 

1  |  -0-//2  2) 

7 - 1  .1/2  g  2  - 1=, - FFe  2  (!) 

V2^|2:1|/  42tt\^2\ 

For  all  of  our  remaining  experiments,  we  shall  only  consider  operators  with  diagonal  covariance 
matrices  Si  and  S2.  Further,  the  diagonal  elements  of  each  matrix  X  shall  be  equal,  yielding  isotropic 
Gaussian  lobes.  For  this  simplified  case,  the  above  equation  can  be  expressed  thusly: 
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We  introduce  also  a  parameter  5  to  represent  the  separation  between  two  lobes.  This  is  of  course 
simply  the  Euclidean  norm  of  the  difference  between  the  two  means. 

8  =  ||//2  —  //,  II  (3) 
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In  order  to  build  a  center-surround  operator,  5  must  be  set  to  zero,  and  the  spatial  constants  of  the 
center  and  surround  should  be  in  a  ratio  of  1  to  1.6  to  match  the  dimensions  of  RFs  found  in  the  human 
visual  system  (Marr  1982).  To  create  a  local-oriented  operator,  we  shall  set  al  =  a2,  and  set  the 
distance  5  to  be  equal  to  3  times  the  value  of  the  spatial  constant.  Finally,  non-local  operators  can  be 
created  by  allowing  the  distance  5  to  exceed  the  value  3a  (once  again  assuming  equal  spatial  constants 
for  the  two  lobes).  Examples  of  all  of  these  operators  are  displayed  in  Figure  6. 


Figure  6  -  Representative  operators  drawn  from  the  four  operator  families  considered  in  experiment  2. 
Top-to-bottom,  we  display  examples  of  center-surround  features,  local  oriented  features,  and  two  kinds 
of  non-local  features  (5  =  6a,  s  =  9a). 

Given  this  simple  parameterization  of  our  three  feature  types,  we  choose  in  this  experiment  to 
sample  equal  numbers  of  each  kind  of  operator  from  the  full  set  of  possible  features.  In  this  way,  we 
may  represent  each  of  our  training  images  in  terms  of  some  small  number  of  features  drawn  from  a 
specific  operator  family  and  evaluate  subsequent  classification  performance. 

Four  operator  families  were  considered:  Center-surround  features  (5=0),  local-oriented 
features  (5=3a),  and  two  kinds  of  non-local  features  (5=  6a  and  9a).  For  each  operator  family,  we 
constructed  40  banks  of  50  randomly  positioned  and  oriented  operators  each.  20  of  these  feature  banks 
contained  operators  with  a  spatial  constant  of  2  pixels,  and  the  other  20  feature  banks  contained 
operators  with  a  4  pixel  spatial  constant.  Each  bank  of  operators  was  applied  to  the  training  images  to 
generate  a  feature  vector  consisting  of  50  values.  The  same  operators  were  then  applied  to  all  test 
images,  and  the  resulting  feature  vectors  were  classified  using  a  nearest-neighbor  metric  (L2  norm). 
This  procedure  was  carried  out  on  both  the  ORL  database  and  the  COIL  database. 

Results 

The  number  of  images  correctly  identified  for  a  given  filter  bank  was  calculated  for  each  recognition 
trial,  allowing  us  to  compute  an  average  level  of  classification  performance  from  the  20  runs  within 
each  operator  family  and  spatial  scale.  We  find  in  this  task  that  once  again  center-surround  and  non¬ 
local  features  offer  the  best  recognition  performance.  This  result  holds  at  both  spatial  scales  used  in 
this  task,  as  well  as  for  both  face  recognition  and  multi-class  object  recognition.  We  also  note  the  small 
variability  in  recognition  performance  around  each  operator’s  mean  value.  The  random  sampling  of 
features  to  fill  up  our  operator  banks  led  to  very  consistent  recognition  performance. 
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Face  recognition  performance  vs.  lobe  size  and  separation 


Object  recognition  performance  vs.  lobe  size  and  separation 


1  r 


distance  between  lobes  in  multiples  of  spatial  constant  distance  between  lobes  in  multiples  of  spatial  constant 


Figure  7  -  Recognition  performance  for  both  faces  (left)  and  objects  (right)  as  a  function  of  both  the 
distance  between  operator  lobes  and  the  spatial  constant  of  the  lobes. 

In  both  cases,  we  note  that  center-surround  performance  slightly  exceeds  that  obtained  using 
non-local  operators.  It  is  interesting  to  note  however,  that  with  a  larger  separation  between  the  lobes  of 
a  non-local  feature  comes  better  recognition  performance.  This  cannot  continue  indefinitely,  of  course, 
as  longer  and  longer  distances  will  lead  to  more  limitations  on  where  operators  can  be  placed  within 
the  image.  Increased  accuracy  with  increased  non-locality  does  suggest  that  larger  distances  between 
lobes  are  more  useful,  however,  and  that  it  is  not  enough  to  simply  deviate  from  locality. 

We  note  that  the  distinct  dip  in  performance  for  local-oriented  features  is  both  consistent  and 
puzzling.  Why  should  it  be  the  case  that  un-oriented  local  features  are  good  at  recognition  while 
oriented  local  features  are  poor?  Center-surround  operators  analyze  almost  the  same  pixels  as  a  local- 
oriented  operator  placed  at  the  same  location,  so  why  should  they  be  so  different  in  terms  of  their 
recognition  performance?  Moreover,  how  is  it  that  radically  different  operators  like  the  dissociated 
dipole  and  the  center-surround  operator  should  perform  so  similarly?  In  our  third  and  final  experiment, 
we  attempt  to  address  these  questions  by  breaking  down  the  recognition  problem  into  distinct  parts  so 
we  can  learn  how  these  operator  families  function  in  classification  tasks. 

Experiment  3 

In  Experiment  2,  we  determined  that  both  center-surround  and  non-local  operators  outperform  local 
oriented  features  at  recognition  of  faces  and  objects.  In  many  ways,  this  is  quite  surprising.  Center- 
surround  features  appear  to  share  little  with  non-local  operators  as  we  have  defined  them,  yet  their 
recognition  performance  is  quite  similar.  In  this  final  Experiment,  we  break  down  the  process  of 
recognition  into  two  distinct  components  to  determine  if  these  two  receptive  field  structures  succeed  at 
recognition  by  possessing  different  properties.  Along  the  way,  we  also  hope  to  discover  why  local 
oriented  features  are  so  poor  at  recognition  despite  their  strong  resemblance  to  center-surround 
features. 

In  this  task  we  shall  break  down  the  recognition  process  into  components  of  stability  and 
variability.  To  perform  well  at  recognition,  a  particular  operator  must  first  be  able  to  respond  in  much 
the  same  way  to  many  different  images  of  the  same  face.  This  is  how  we  define  stability,  and  one  can 
think  of  it  in  terms  of  various  identity-preserving  transformations.  Whether  a  face  is  smiling  or  not,  lit 
from  the  side  or  not,  a  useful  operator  for  recognition  must  not  vary  its  response  too  widely.  If  this 
proves  true,  we  may  say  that  that  feature  is  invariant  to  the  transformation  being  considered. 

We  use  this  notion  of  stability  to  formulate  an  operational  definition  of  stability  in  terms  of  a 
set  of  image  measurements,  and  a  particular  face  transformation.  Let  us  first  imagine  that  we  possess  a 
set  of  image  measurements  in  a  filter  bank,  just  as  we  did  in  Experiment  2.  This  filter  bank  is  applied 
to  some  initial  image,  which  shall  always  depict  a  person  in  frontal  view  with  a  neutral  expression.  The 
value  of  each  operator  in  our  collection  can  be  determined  and  stored  in  a  one-dimensional  vector,  x. 
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This  same  set  of  operators  is  then  applied  to  a  second  image,  depicting  the  same  person  as  the  original 
image,  but  with  some  change  of  expression  or  pose.  The  values  resulting  from  applying  all  operators  to 
this  new  image  are  then  stored  in  a  second  vector,  y.  The  two  vectors  x  and  y  may  then  be  compared  to 
see  how  drastic  the  changes  in  operator  response  were  across  the  transformation  from  the  first  image  to 
the  second.  If  by  some  luck  our  operators  are  perfectly  invariant  to  the  current  transformation,  plotting 
x  vs.  y  would  produce  a  scatterplot  in  which  all  points  would  lie  on  the  line  y=x.  Poor  invariance  would 
be  reflected  in  a  plot  in  which  points  are  distributed  in  an  isotropic  cluster.  For  two  vectors  x  and  y 
(each  of  length  n),  we  may  use  the  value  of  the  correlation  coefficient  (see  Equation  4)  between  them 
as  our  quantitative  measure  of  feature  stability. 


n(Zxy  )  -  (Zx)(£y) 

•2  -(2x)2][h2v2  -(Sv)2] 


The  second  component  of  recognition  is  variability.  It  is  not  enough  to  be  stable  to 
transformations,  one  must  also  be  diagnostic  of  identity.  Imagine  for  example,  that  one  finds  an  image 
measurement  that  is  perfectly  stable  across  lighting,  expression,  and  pose  transformations.  It  may  seem 
that  this  measurement  is  ideal  for  recognition,  but  let  us  also  imagine  that  it  turns  out  to  be  of  the  same 
value  for  every  face  which  you  consider!  Clearly  you  have  no  means  of  distinguishing  one  face  from 
another  using  this  measurement,  despite  its  remarkable  invariance  to  transformations  of  a  single  face. 
What  is  needed  then,  is  an  ability  to  be  stable  within  images  of  a  single  face,  but  vary  broadly  across 
images  of  many  different  faces.  This  last  attribute  we  shall  call  variability,  and  we  may  quantify  it  for  a 
particular  measurement  as  the  variance  of  its  response  across  a  population  of  faces. 

In  this  third  experiment,  we  shall  use  these  operational  definitions  of  stability  and  variability 
to  determine  what  properties  center-surround  and  non-local  operators  possess  that  makes  them  useful 
for  recognition.  We  shall  return  once  again  to  the  domain  of  faces,  as  they  provide  a  rich  set  of 
transformations  to  consider,  both  rigid  and  non-rigid  alterations  of  the  face  in  varying  degree. 

Stimuli 

We  use  16  faces  (8  men,  8  women)  from  the  Stirling  face  database  for  this  experiment.  The  faces  are 
grayscale  images  of  individuals  in  a  neutral,  frontal  pose  accompanied  by  pictures  of  the  same  models 
smiling  and  speaking  while  facing  forward,  and  also  in  a  three-quarter  pose  with  neutral  expression. 

We  call  these  transformations  the  SMILE,  SPEECH,  and  VIEW  transforms  respectively.  The  original 
images  were  284x365  pixels,  and  the  only  pre-processing  step  applied  was  to  crop  out  a  256x256  pixel 
region  centered  in  the  original  image  rectangle. 

Procedure 

All  operators  in  these  sets  were  built  as  difference-of-Gaussian  features,  exactly  as  described  in 
Experiment  2.  Also  as  before,  center-surround,  local  oriented,  and  two  kinds  of  non-local  features  were 
evaluated.  Three  ‘scales’  were  employed  for  each  kind  of  feature,  corresponding  to  a  space  constant  of 
4  pixels  (fine  scale),  8  pixels  (medium  scale),  and  16  pixels  (coarse  scale).  In  the  case  of  center- 
surround  features,  the  value  of  the  space  constant  always  refers  to  the  size  of  the  surround.  For  each 
pair  of  images  to  be  analyzed,  we  construct  a  total  of  120  collections  of  50  operators  each.  These 
feature  banks  were  split  into  10  center-surround,  10  local,  and  20  non-local  banks  (10  banks  each  for 
separations  of  6  and  9  times  the  spatial  constant  of  the  lobes)  at  each  of  the  3  scales  mentioned  above. 

Once  a  set  of  operators  was  constructed,  we  applied  it  to  each  neutral,  frontal  image  in  our 
data  set  to  assemble  the  feature  value  for  the  starting  image.  The  same  operators  were  then  applied  to 
each  of  the  three  transformed  images  so  that  a  value  for  Pearson’s  R  could  be  calculated  for  that  set  of 
operators  relative  to  each  transformation.  The  average  value  of  Pearson’s  R  could  then  be  taken  across 
all  16  faces  in  our  set.  This  process  was  repeated  for  all  families  and  scales  of  operator  banks  to  assess 
stability. 

To  assess  variability,  operator  banks  were  once  again  applied  to  the  neutral,  frontal  images 
once  again.  This  time,  the  variance  in  each  operator’s  output  was  calculated  across  the  population  of  16 
faces.  The  results  were  combined  and  expressed  in  terms  of  the  mean  variance  of  response  and  its 
standard  deviation. 
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Results 

Difference-of-Gaussian  features 

Plots  depicting  the  average  values  of  the  correlation  coefficients  (averaged  again  over  all  individuals) 
are  presented  below  (Figure  8).  We  present  the  measured  stability  of  each  kind  of  operator  across  three 
ecologically  relevant  transformations:  SMILE  (2nd  image  of  individuals  smiling),  SPEECH  (2nd  image 
of  individuals  speaking),  and  VIEW  (2nd  image  of  individuals  in  %  pose). 

These  plots  highlight  several  interesting  characteristics  of  our  operators.  First,  center-surround 
filters  at  each  of  our  3  scales  appear  to  perform  quite  well  compared  to  the  other  features  once  again. 

As  soon  as  we  move  the  two  Gaussians  apart  to  form  oriented  local  operators,  however,  a  sharp  dip  in 
stability  occurs  at  the  medium  and  fine  scales.  This  indicates  that  the  two-lobed  oriented  edge  detectors 
used  here  provide  for  comparatively  poor  stability  across  all  three  of  the  transformations  we  have 
examined  here.  That  said,  as  the  distance  between  the  lobes  of  our  operators  increases  further,  stability 
of  response  also  increases.  Non-locality  seems  to  increase  stability  across  all  three  transformations, 
nearly  reaching  the  level  of  center-surround  stability  at  both  medium  and  coarse  scales. 
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Figure  8  -  The  stability  of  each  feature  type  (x-axis)  as  a  function  of  both  the  spatial  scale  of  the 
Gaussian  lobes,  and  various  facial  transformations. 

Stability,  however,  is  not  the  only  attribute  required  to  perform  recognition  tasks  well.  A 
feature  that  is  stable  across  face  transformations  is  only  useful  if  it  is  not  also  stable  across  images  of 
different  individuals.  That  is,  a  universal  feature  is  not  of  any  use  for  recognition  because  it  has  no 
discriminative  power.  We  present  next  the  amount  of  variability  in  response  for  each  family  of 
operators  (Table  1). 


Table  1  -  Mean±S.E.  of  operator  variance  across  individuals 


o  =  4 

0  =  8 

a  =  16 

Center- Surround 

122.5±3.7 

206.6±6.2 

31 1.3±8.5 

Local  (s=3) 

242.0±9.6 

527.0±15.0 

986.9±26.7 

Non-Local  (s=6) 

378.8±1 1.4 

718.5±17.7 

1204.1±29.9 

Non-Local  (s=9) 

430.2±1 1.0 

795.4±19.7 

1271.7±32.6 

Center-surround  operators  appear  to  be  the  least  variable  across  images  of  different 
individuals,  while  non-local  operators  appear  to  vary  most.  All  feature  types  but  the  center-surround 
filters  increase  in  variability  as  their  scale  increases,  which  seems  somewhat  surprising  as  one  might 
expect  more  dramatic  differences  in  individual  appearance  to  be  expressed  at  a  finer  scale. 
Nonetheless,  we  can  see  from  the  combination  of  these  results  and  the  stability  results  that  center- 
surround  and  non-local  operators  achieve  good  recognition  performance  through  different  means. 
Center-surround  operators  are  not  so  variable  from  person  to  person,  but  make  up  for  it  with  an 
extremely  stable  response  to  individual  faces  despite  significant  transformations.  In  contrast,  non-local 
operators  lack  the  full  stability  of  center-surround  operators,  but  appear  to  make  up  for  it  by  being 
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much  more  variable  in  response  across  the  population  of  faces.  Coming  dead  last  of  course  are  the 
local-oriented  features,  which  appear  to  be  neither  stable  nor  variable  in  a  useful  fashion. 

Discussion 

The  results  of  our  stability  analysis  of  differential  operators  reveal  two  main  findings.  First,  the  same 
features  that  were  discovered  to  perform  the  best  discrimination  between  intra-  and  inter-personal 
difference  vectors  in  Experiment  1  (large  center-surround  filters  and  non-local  operators)  and  to 
perform  best  in  a  simple  recognition  system  for  both  faces  and  objects  (Experiment  2)  also  display  the 
greatest  combination  of  stability  and  variability  when  confronted  with  ecologically  relevant  face 
transforms.  However,  the  limited  stability  of  local  oriented  operators  suggests  that  they  may  not 
provide  the  most  useful  features  for  handling  these  image  transforms. 

Conclusions 

We  have  noted  the  emergence  of  large  center-surround  and  non-local  operators  as  tools  for  performing 
object  discrimination  using  simple  features,  and  found  that  both  of  these  operators  provide  for  good 
stability  of  response  across  a  range  of  different  transforms.  These  structures  differ  from  receptive  field 
forms  known  to  support  sparse  encoding  of  natural  scenes,  yet  seem  to  provide  a  better  means  of 
discriminating  between  individual  objects  and  providing  stable  responses  to  image  transforms.  We  take 
this  to  mean  that  the  constraints  that  govern  information-theoretic  approaches  to  image  representation 
may  not  necessarily  be  useful  for  developing  representations  that  can  support  the  recognition  of  objects 
in  images. 

In  the  specific  context  of  faces,  do  large  center-surround  fields  or  non-local  comparators 
present  a  viable  alternative  to  performing  efficient  face  recognition?  At  present,  the  answer  to  this 
question  is  no.  Complex  (and  truly  global)  features  such  as  eigenface  (Turk  and  Pentland  1991)  bases 
provide  for  higher  levels  of  recognition  performance  than  we  expect  to  achieve  using  these  far  simpler 
features.  We  note  however  that  the  discovery  of  a  useful  vocabulary  of  low-level  features  may  aid 
global  recognition  techniques  like  eigenface-based  systems.  One  could  easily  compute  PCA  bases  on 
non-local  and  center-surround  measurements  rather  than  pixels.  The  added  stability  of  these  operators 
may  help  increase  recognition  performance  greatly. 

The  larger  question  at  stake,  however,  does  not  only  concern  face  recognition,  despite  it  being 
our  domain  of  choice  for  the  current  study.  Of  greater  interest  than  building  a  face  recognition  engine 
is  leaerning  how  one  might  construct  invariance  to  relevant  image  transforms  given  some  set  of  simple 
measures.  Little  is  known  about  how  one  moves  from  highly  selective,  small  receptive  fields  in  V 1  to 
the  large  receptive  fields  in  IT  that  demonstrate  great  invariance  to  stimulus  manipulations  within  a 
particular  class.  We  introduce  a  particular  computation,  the  dissociated  dipole,  that  represents  one 
example  of  a  very  broad  space  of  alternative  computations  by  which  limited  amounts  of  invariance 
might  be  achieved.  Our  proposal  of  non-local  operators  draws  support  from  several  studies  of  human 
perception.  The  idea  of  non-local  computation  is  not  new,  nor  absent  from  studies  of  human 
perception.  Indeed,  past  psychophysical  studies  of  the  long-range  processing  of  pairs  of  lines  suggest 
the  existence  of  similarly  structured  “coincidence  detectors”  which  enact  non-local  comparisons  of 
simple  stimuli  (Morgan  and  Regan  1987;  Kohly  and  Regan  2000).  Further  work  exploring  non-local 
processing  of  orientation  and  contrast  has  more  recently  given  rise  to  the  idea  of  a  “cerebral  bus” 
shuttling  information  between  distant  points  (Danilova  and  Mollon  2003).  These  detectors  could 
contribute  to  shape  representation,  as  demonstrated  by  Bmbeck’s  idea  of  encoding  shapes  via  medial 
“cores”  built  by  integrating  information  across  disparate  “boundariness”  detectors  (Burbeck  and  Pizer 
1995). 

Our  overarching  goal  in  this  work  is  to  redirect  the  study  of  non-classical  receptive  field 
structures  towards  examining  the  possibility  that  object  recognition  may  be  governed  by  computations 
outside  the  realm  of  traditional  multi-scale  pyramids,  and  subject  to  different  constraints  than  those 
that  guide  formulations  of  image  representation  based  on  information  theory.  The  road  from  VI  to  IT 
(and  computationally  speaking,  from  Gabors  and  Gaussian  derivatives  to  eigenfaces)  may  contain 
many  surprising  image  processing  tools. 

Even  within  the  realm  of  dissociated  dipoles  there  are  many  parameters  to  explore.  For 
example,  the  two  lobes  need  not  be  isotropic,  or  be  of  equal  size  and  orientation.  The  lobes  could 
easily  take  the  form  of  Gaussian  derivatives  rather  than  Gaussians.  Given  that  there  are  many  more 
parameters  that  could  be  introduced  to  the  simple  DOG  framework,  it  is  possible  that  even  better 
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invariance  could  be  achieved  by  introducing  more  degrees  of  structural  freedom.  The  point  is  that 
expanding  our  consideration  to  non-local  operators  opens  up  a  large  space  of  possible  filters,  and 
systematic  exploration  of  this  space,  while  difficult,  may  be  very  rewarding. 
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